Core: Replace Set with Bitmap to make delete filtering simpler and faster#3535
Core: Replace Set with Bitmap to make delete filtering simpler and faster#3535rdblue merged 2 commits intoapache:masterfrom
Conversation
| @@ -42,4 +42,9 @@ public void delete(long posStart, long posEnd) { | |||
| public boolean deleted(long position) { | |||
There was a problem hiding this comment.
Is there a reason why we didn't use isDeleted? That is the normal convention for boolean methods.
There was a problem hiding this comment.
Made the change in the new commit.
| @@ -125,12 +100,6 @@ public static PositionDeleteIndex toPositionBitmap(CloseableIterable<Long> posDe | |||
| } | |||
There was a problem hiding this comment.
Should this method be named toPositionDeleteIndex? The way PositionDeleteIndex is modeled, it could be implemented by a set.
|
@flyrain, the data on #3287 supports the idea of using a bitmap instead of a set of long values. But this PR actually removes the streaming delete filter, which doesn't load all of the deletes into memory at the same time. Instead, it streams through the delete file to avoid keeping deletes in memory. Can you provide data comparing the memory consumption and performance of streaming deletes? |
rdblue
left a comment
There was a problem hiding this comment.
I think this needs further discussion. Although there is evidence that a bitmap is better for memory consumption than set of positions, the effect of this PR is actually to remove the more memory efficient streaming delete filter.
I think that this needs to either only replace set with bitmap (as in the summary) or we need more justification that removing the streaming version is also a good idea.
|
Thanks @rdblue for the review. Sorry for the delayed update. I agree we need more justification to remove streaming version. I will try to create another one with perf test for that. I'll update this PR to replace set with bitmap, and the refactor you mentioned for the class |
87bd108 to
9edb689
Compare
|
It'd be nice to get into 0.13 release since it has the change of API |
There was a problem hiding this comment.
Mostly look good to me, while waiting for the performance test results.
There are quite a few interface changes for the public Deletes class, ideally I think we should just preserve the names or at least mark the old one as deprecated for one release, but I don't know if there is really any place that is depending on this. I feel it's public more for the purpose of the ease of use.
|
|
||
| @Override | ||
| public boolean deleted(long position) { | ||
| public boolean isDeleted(long position) { |
There was a problem hiding this comment.
It is suggested by @rdblue. Make sense to me as well. It is less ambiguous.
|
Thanks @jackye1995 for the review. Made the change per your suggestion. To be clear, there is no need of perf test in this PR. I will open another PR for replacing the streaming part, which needs the perf test. I agreed with you about the interface. To add to it, the interface |
|
Hi @rdblue and @jackye1995, Happy new yea! Do you get a chance to take a look this PR? |
|
Thanks, @flyrain! |
|
Thanks @rdblue and @jackye1995 for the review! |
This PR replaces
Set<Long>withRoaring64Bitmapto store the deleted positions for filtering. Bitmap use significant less memory than a set, here is a benchmark. With that, we don't have to use streaming filter in case of large amount of positions, we can keep all deleted positions in memory.cc @aokolnychyi @rdblue @RussellSpitzer @karuppayya @szehon-ho